{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Mountain car climbing using A3C\n",
    "\n",
    "Let's implement the A3C algorithm for the mountain car climbing task. In the mountain car\n",
    "climbing environment, a car is placed between the two mountains and the goal of the agent\n",
    "is to drive up the mountain on the right. But the problem is, the agent can't drive up the\n",
    "mountain in one pass. So, the agent has to drive back and forth to build momentum to\n",
    "drive up the mountain on the right. A high reward will be assigned if our agent spends less\n",
    "energy on driving up. The Mountain car environment is shown in the below figure:\n",
    "\n",
    "![title](Images/2.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The code used in this section is adapted from the open-source implementation of A3C\n",
    "(https://github.com/stefanbo92/A3C-Continuous) provided by Stefan Boschenriedter.\n",
    "\n",
    "First, let's import the necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.0.0\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "print(tf.__version__)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "import gym\n",
    "import multiprocessing\n",
    "import threading\n",
    "import numpy as np\n",
    "import os\n",
    "import shutil\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a clear understanding of how the A2C method works, we use\n",
    "TensorFlow in the non-eager mode by disabling TensorFlow 2 behavior."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow.compat.v1 as tf\n",
    "tf.disable_v2_behavior()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating the mountain car environment\n",
    "\n",
    "Let's create a mountain car environment using the gym. Note that our mountain car\n",
    "environment is a continuous environment meaning that our action space is continuous:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make('MountainCarContinuous-v0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the state shape of the environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "state_shape = env.observation_space.shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the action shape of the environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "action_shape = env.action_space.shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we created the continuous mountain car environment and thus our action space\n",
    "consists of continuous values. So, we get the bound of our action space: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "action_bound = [env.action_space.low, env.action_space.high]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining the variables\n",
    "\n",
    "\n",
    "Now, let's define some of the important variables.\n",
    "\n",
    "Define the number of workers as the number of CPUs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_workers = multiprocessing.cpu_count() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the number of episodes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_episodes = 2000 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the number of time steps:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_timesteps = 200 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the global network (global agent) scope:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "global_net_scope = 'Global_Net'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the time step at which we want to update the global network:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "update_global = 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the discount factor, $\\gamma$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "gamma = 0.90 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the beta value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "beta = 0.01 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the directory where we want to store the logs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "log_dir = 'logs'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining the actor critic class\n",
    "\n",
    "We learned that in A3C both the global and worker agents follow the actor critic\n",
    "architecture. So, let's define the class called ActorCritic where we will implement the\n",
    "actor critic algorithm. For a clear understanding, you can check the detailed explanation of code on the book."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "class ActorCritic(object):\n",
    "    \n",
    "     #first, let's define the init method\n",
    "     def __init__(self, scope, sess, globalAC=None):\n",
    "            \n",
    "        #initialize the TensorFlow session\n",
    "        self.sess=sess\n",
    "        \n",
    "        #define the actor network optimizer as RMS prop\n",
    "        self.actor_optimizer = tf.train.RMSPropOptimizer(0.0001, name='RMSPropA')\n",
    "        \n",
    "        #define the critic network optimizer as RMS prop\n",
    "        self.critic_optimizer = tf.train.RMSPropOptimizer(0.001, name='RMSPropC')\n",
    " \n",
    "        #if the scope is the global network (global agent)\n",
    "        if scope == global_net_scope:\n",
    "            with tf.variable_scope(scope):\n",
    "                    \n",
    "                #define the placeholder for the state\n",
    "                self.state = tf.placeholder(tf.float32, [None, state_shape], 'state')\n",
    "                \n",
    "                #build the global network (global agent) and get the actor and critic parameters\n",
    "                self.actor_params, self.critic_params = self.build_network(scope)[-2:]\n",
    "      \n",
    "        #if the network is not the global network then\n",
    "        else:\n",
    "            with tf.variable_scope(scope):\n",
    "                \n",
    "                #define the placeholder for the state\n",
    "                self.state = tf.placeholder(tf.float32, [None, state_shape], 'state')\n",
    "                \n",
    "                #we learned that our environment is the continuous environment, so the actor network\n",
    "                #(policy network) returns the mean and variance of the action and then we build the action\n",
    "                #distribution out of this mean and variance and select the action based on this action \n",
    "                #distribution. \n",
    "                \n",
    "                #define the placeholder for obtaining the action distribution\n",
    "                self.action_dist = tf.placeholder(tf.float32, [None, action_shape], 'action')\n",
    "                \n",
    "                #define the placeholder for the target value\n",
    "                self.target_value = tf.placeholder(tf.float32, [None, 1], 'Vtarget')\n",
    "                \n",
    "                #build the worker network (worker agent) and get the mean and variance of the action, the\n",
    "                #value of the state, and actor and critic network parameters:\n",
    "                mean, variance, self.value, self.actor_params, self.critic_params = self.build_network(scope)\n",
    "\n",
    "                #Compute the TD error which is the difference between the target value of the state and the\n",
    "                #predicted value of the state\n",
    "                td_error = tf.subtract(self.target_value, self.value, name='TD_error')\n",
    "    \n",
    "                #now, let's define the critic network loss\n",
    "                with tf.name_scope('critic_loss'):\n",
    "                    self.critic_loss = tf.reduce_mean(tf.square(td_error))\n",
    "                    \n",
    "                with tf.name_scope('wrap_action'):\n",
    "                    mean, variance = mean * action_bound[1], variance + 1e-4\n",
    "                    \n",
    "                #create a normal distribution based on the mean and variance of the action\n",
    "                normal_dist = tf.distributions.Normal(mean, variance)\n",
    "\n",
    "\n",
    "                #now, let's define the actor network loss\n",
    "                with tf.name_scope('actor_loss'):\n",
    "                    \n",
    "                    #compute the log probability of the action\n",
    "                    log_prob = normal_dist.log_prob(self.action_dist)\n",
    "         \n",
    "                    #define the entropy of the policy\n",
    "                    entropy_pi = normal_dist.entropy()\n",
    "                    \n",
    "                    #compute the actor network loss\n",
    "                    self.loss = log_prob * td_error + (beta * entropy_pi)\n",
    "                    self.actor_loss = tf.reduce_mean(-self.loss)\n",
    "       \n",
    "                #select the action based on the normal distribution\n",
    "                with tf.name_scope('select_action'):\n",
    "                    self.action = tf.clip_by_value(tf.squeeze(normal_dist.sample(1), axis=0), \n",
    "                                                   action_bound[0], action_bound[1])\n",
    "     \n",
    "        \n",
    "                #compute the gradients of actor and critic network loss of the worker agent (local agent)\n",
    "                with tf.name_scope('local_grad'):\n",
    "\n",
    "                    self.actor_grads = tf.gradients(self.actor_loss, self.actor_params)\n",
    "                    self.critic_grads = tf.gradients(self.critic_loss, self.critic_params)\n",
    " \n",
    "            #now, let's perform the sync operation\n",
    "            with tf.name_scope('sync'):\n",
    "                \n",
    "                #after computing the gradients of the loss of the actor and critic network, worker agent\n",
    "                #sends (push) those gradients to the global agent\n",
    "                with tf.name_scope('push'):\n",
    "                    self.update_actor_params = self.actor_optimizer.apply_gradients(zip(self.actor_grads,\n",
    "                                                                                        globalAC.actor_params))\n",
    "                    self.update_critic_params = self.critic_optimizer.apply_gradients(zip(self.critic_grads, \n",
    "                                                                                          globalAC.critic_params))\n",
    "\n",
    "                #global agent updates their parameter with the gradients received from the worker agents\n",
    "                #(local agents). Then the worker agents, pull the updated parameter from the global agent\n",
    "                with tf.name_scope('pull'):\n",
    "                    self.pull_actor_params = [l_p.assign(g_p) for l_p, g_p in zip(self.actor_params, \n",
    "                                                                                  globalAC.actor_params)]\n",
    "                    self.pull_critic_params = [l_p.assign(g_p) for l_p, g_p in zip(self.critic_params, \n",
    "                                                                                   globalAC.critic_params)]\n",
    "                \n",
    "\n",
    "     #let's define the function for building the actor critic network\n",
    "     def build_network(self, scope):\n",
    "            \n",
    "        #initialize the weight:\n",
    "        w_init = tf.random_normal_initializer(0., .1)\n",
    "        \n",
    "        #define the actor network which returns the mean and variance of the action\n",
    "        with tf.variable_scope('actor'):\n",
    "            l_a = tf.layers.dense(self.state, 200, tf.nn.relu, kernel_initializer=w_init, name='la')\n",
    "            mean = tf.layers.dense(l_a, action_shape, tf.nn.tanh,kernel_initializer=w_init, name='mean')\n",
    "            variance = tf.layers.dense(l_a, action_shape, tf.nn.softplus, kernel_initializer=w_init, name='variance')\n",
    "            \n",
    "        #define the critic network which returns the value of the state\n",
    "        with tf.variable_scope('critic'):\n",
    "            l_c = tf.layers.dense(self.state, 100, tf.nn.relu, kernel_initializer=w_init, name='lc')\n",
    "            value = tf.layers.dense(l_c, 1, kernel_initializer=w_init, name='value')\n",
    "        \n",
    "        actor_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/actor')\n",
    "        critic_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/critic')\n",
    "        \n",
    "        #Return the mean and variance of the action produced by the actor network, value of the\n",
    "        #state computed by the critic network and the parameters of the actor and critic network\n",
    "        \n",
    "        return mean, variance, value, actor_params, critic_params\n",
    "    \n",
    "     #let's define a function called update_global for updating the parameters of the global\n",
    "     #network with the gradients of loss computed by the worker networks, that is, the push operation\n",
    "     def update_global(self, feed_dict):\n",
    "        self.sess.run([self.update_actor_params, self.update_critic_params], feed_dict)\n",
    "     \n",
    "     #we also define a function called pull_from_global for updating the parameters of the\n",
    "     #worker networks by pulling from the global network, that is, the pull operation\n",
    "     def pull_from_global(self):\n",
    "        self.sess.run([self.pull_actor_params, self.pull_critic_params])\n",
    "     \n",
    "     #define a function called select_action for selecting the action\n",
    "     def select_action(self, state):   \n",
    "        state = state[np.newaxis, :]\n",
    "        return self.sess.run(self.action, {self.state: state})[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining the worker class\n",
    "\n",
    "Let's define the class called Worker where we will implement the worker agent:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Worker(object):\n",
    "    \n",
    "    #first, let's define the init method:\n",
    "    def __init__(self, name, globalAC, sess):\n",
    "\n",
    "        #we learned that each worker agent works with their own copies of the environment. So,\n",
    "        #let's create a mountain car environment\n",
    "        self.env = gym.make('MountainCarContinuous-v0').unwrapped\n",
    "        \n",
    "        #define the name of the worker\n",
    "        self.name = name\n",
    "    \n",
    "        #create an object to our ActorCritic class\n",
    "        self.AC = ActorCritic(name, sess, globalAC)\n",
    "        \n",
    "        #initialize a TensorFlow session\n",
    "        self.sess=sess\n",
    "        \n",
    "    #define a function called work for the worker to learn:\n",
    "    def work(self):\n",
    "        global global_rewards, global_episodes\n",
    "        \n",
    "        #initialize the time step\n",
    "        total_step = 1\n",
    "     \n",
    "        #initialize a list for storing the states, actions, and rewards\n",
    "        batch_states, batch_actions, batch_rewards = [], [], []\n",
    "        \n",
    "        #when the global episodes are less than the number of episodes and coordinator is active\n",
    "        while not coord.should_stop() and global_episodes < num_episodes:\n",
    "            \n",
    "            #initialize the state by resetting the environment\n",
    "            state = self.env.reset()\n",
    "            \n",
    "            #initialize the return\n",
    "            Return = 0\n",
    "            \n",
    "            #for each step in the environment\n",
    "            for t in range(num_timesteps):\n",
    "                \n",
    "                #render the environment of only the worker 0:\n",
    "                if self.name == 'W_0':\n",
    "                    self.env.render()\n",
    "                    \n",
    "                #select the action\n",
    "                action = self.AC.select_action(state)\n",
    "                \n",
    "                #perform the selected action\n",
    "                next_state, reward, done, _ = self.env.step(action)\n",
    "                \n",
    "                #set done to true if we reached the final step of the episode else set to false\n",
    "                done = True if t == num_timesteps - 1 else False\n",
    "                \n",
    "                #update the return\n",
    "                Return += reward\n",
    "                \n",
    "                #store the state, action, and reward into the lists\n",
    "                batch_states.append(state)\n",
    "                batch_actions.append(action)\n",
    "                batch_rewards.append((reward+8)/8)\n",
    "    \n",
    "                #now, let's update the global network. If done is true then set the value of next state to 0 else\n",
    "                #the compute the value of the next state\n",
    "                if total_step % update_global == 0 or done:\n",
    "                    if done:\n",
    "                        v_s_ = 0\n",
    "                    else:\n",
    "                        v_s_ = self.sess.run(self.AC.value, {self.AC.state: next_state[np.newaxis, :]})[0, 0]\n",
    " \n",
    "                    batch_target_value = []\n",
    "                    \n",
    "                    #compute the target value which is sum of reward and discounted value of next state\n",
    "                    for reward in batch_rewards[::-1]:\n",
    "                        v_s_ = reward + gamma * v_s_\n",
    "                        batch_target_value.append(v_s_)\n",
    "\n",
    "                    #reverse the target value\n",
    "                    batch_target_value.reverse()\n",
    "                    \n",
    "                    #stack the state, action and target value\n",
    "                    batch_states, batch_actions, batch_target_value = np.vstack(batch_states), np.vstack(batch_actions), np.vstack(batch_target_value)\n",
    "                    \n",
    "                    #define the feed dictionary\n",
    "                    feed_dict = {\n",
    "                                 self.AC.state: batch_states,\n",
    "                                 self.AC.action_dist: batch_actions,\n",
    "                                 self.AC.target_value: batch_target_value,\n",
    "                                 }\n",
    "                    \n",
    "                    #update the global network\n",
    "                    self.AC.update_global(feed_dict)\n",
    "                    \n",
    "                    #empty the lists:\n",
    "                    batch_states, batch_actions, batch_rewards = [], [], []\n",
    "                    \n",
    "                    #update the worker network by pulling the parameters from the global network:\n",
    "                    self.AC.pull_from_global()\n",
    "                    \n",
    "                #update the state to the next state and increment the total step:\n",
    "                state = next_state\n",
    "                total_step += 1\n",
    "                \n",
    "                #update global rewards:\n",
    "                if done:\n",
    "                    if len(global_rewards) < 5:\n",
    "                        global_rewards.append(Return)\n",
    "                    else:\n",
    "                        global_rewards.append(Return)\n",
    "                        global_rewards[-1] =(np.mean(global_rewards[-5:]))\n",
    "                    \n",
    "                    global_episodes += 1\n",
    "                    break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the network\n",
    "\n",
    "Now, let's start training the network. Initialize the global rewards list and also initialize the\n",
    "global episodes counter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "global_rewards = []\n",
    "global_episodes = 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Start the TensorFlow session:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "sess = tf.Session()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with tf.device(\"/cpu:0\"):\n",
    "    \n",
    "    #create a global agent\n",
    "    global_agent = ActorCritic(global_net_scope,sess)\n",
    "    worker_agents = []\n",
    "    \n",
    "    #create n number of worker agent:\n",
    "    for i in range(num_workers):\n",
    "        i_name = 'W_%i' % i\n",
    "        worker_agents.append(Worker(i_name, global_agent,sess))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the TensorFlow coordinator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "coord = tf.train.Coordinator()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initialize all the TensorFlow variables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "sess.run(tf.global_variables_initializer())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Store the TensorFlow computational graph to the log directory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "if os.path.exists(log_dir):\n",
    "    shutil.rmtree(log_dir)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tensorflow.python.summary.writer.writer.FileWriter at 0x7ff8025eea20>"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf.summary.FileWriter(log_dir, sess.graph)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, run the worker threads:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "worker_threads = []\n",
    "for worker in worker_agents:\n",
    "\n",
    "    job = lambda: worker.work()\n",
    "    thread = threading.Thread(target=job)\n",
    "    thread.start()\n",
    "    worker_threads.append(thread)\n",
    "\n",
    "\n",
    "coord.join(worker_threads)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a better understanding of A3C architecture, let's take a look at the computational graph\n",
    "of A3C in the next section. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
